Our team has chosen the Ames Housing dataset for our Data 606 final project. This is an extremely interesting dataset with many features, which makes it well suited for analysis. This dataset is well suited for regression analysis in addition to utilizing sampling techniques and data imputation methods.
The primary question that we intend to answer with this project is: Can we accurately predict the sale price of a house given variables describing the house? This is of interest to us as understanding what factors affect a home price is useful when purchasing and selling a home, which, most people will do at some time in their life and it is a major purchase. The specific data set is for the city of Ames, Iowa, so in that respect it is not especially interesting as we will probably never purchase or sell a home in Ames, but the techniques and methods we employ during our project will be useful in the future.
In this report we intend to explore this data set looking for relationships between the target variable and the predictors. There is a large number of initial features (80), with some categorical data as well as numerical data. Below we detail out each of the individual variables that comprise our data set.
| Name | Description | Type |
|---|---|---|
| SalePrice | The property’s sale price in dollars. This is the target variable that you’re trying to predict. | Numeric |
| MSSubClass | Identifies the type of dwelling involved in the sale. | Category - Nominal |
| MSZoning | Identifies the general zoning classification of the sale. | Category - Nominal |
| LotFrontage | Linear feet of street connected to property | Numeric |
| LotArea | Lot size in square feet | Numeric |
| Street | Type of road access | Category - Nominal |
| Alley | Type of alley access | Category - Nominal |
| LotShape | General shape of property | Category - Nominal |
| LandContour | Flatness of the property | Category - Nominal |
| Utilities | Type of utilities available | Category - Nominal |
| LotConfig | Lot configuration | Category - Nominal |
| LandSlope | Slope of property | Category - Nominal |
| Neighborhood | Physical locations within Ames city limits | Category - Nominal |
| Condition1 | Proximity to main road or railroad | Category - Nominal |
| Condition2 | Proximity to main road or railroad (if a second is present) | Category - Nominal |
| BldgType | Type of dwelling | Category - Nominal |
| HouseStyle | Style of dwelling | Category - Nominal |
| OverallQual | Overall material and finish quality | Category - Ordinal |
| OverallCond | Overall condition rating | Category - Ordinal |
| YearBuilt | Original construction date | Date |
| YearRemodAdd | Remodel date | Date |
| RoofStyle | Type of roof | Category - Nominal |
| RoofMatl | Roof material | Category - Nominal |
| Exterior1st | Exterior covering on house | Category - Nominal |
| Exterior2nd | Exterior covering on house (if more than one material) | Category - Nominal |
| MasVnrType | Masonry veneer type | Category - Nominal |
| MasVnrArea | Masonry veneer area in square feet | Numeric |
| ExterQual | Exterior material quality | Category - Nominal |
| ExterCond | Present condition of the material on the exterior | Category - Nominal |
| Foundation | Type of foundation | Category - Nominal |
| BsmtQual | Height of the basement | Category - Nominal |
| BsmtCond | General condition of the basement | Category - Nominal |
| BsmtExposure | Walkout or garden level basement walls | Category - Nominal |
| BsmtFinType1 | Quality of basement finished area | Category - Nominal |
| BsmtFinSF1 | Type 1 finished square feet | Numeric |
| BsmtFinType2 | Quality of second finished area (if present) | Category - Nominal |
| BsmtFinSF2 | Type 2 finished square feet | Numeric |
| BsmtUnfSF | Unfinished square feet of basement area | Numeric |
| TotalBsmtSF | Total square feet of basement area | Numeric |
| Heating | Type of heating | Category - Nominal |
| HeatingQC | Heating quality and condition | Category - Nominal |
| CentralAir | Central air conditioning | Category - Nominal |
| Electrical | Electrical system | Category - Nominal |
| 1stFlrSF | First Floor square feet | Numeric |
| 2ndFlrSF | Second floor square feet | Numeric |
| LowQualFinSF | Low quality finished square feet (all floors) | Numeric |
| GrLivArea | Above grade (ground) living area square feet | Numeric |
| BsmtFullBath | Basement full bathrooms | Numeric |
| BsmtHalfBath | Basement half bathrooms | Numeric |
| FullBath | Full bathrooms above grade | Numeric |
| HalfBath | Half baths above grade | Numeric |
| Bedroom | Number of bedrooms above basement level | Numeric |
| Kitchen | Number of kitchens | Numeric |
| KitchenQual | Kitchen quality | Category - Nominal |
| TotRmsAbvGrd | Total rooms above grade (does not include bathrooms) | Numeric |
| Functional | Home functionality rating | Category - Nominal |
| Fireplaces | Number of fireplaces | Numeric |
| FireplaceQu | Fireplace quality | Category - Nominal |
| GarageType | Garage location | Category - Nominal |
| GarageYrBlt | Year garage was built | Date |
| GarageFinish | Interior finish of the garage | Category - Nominal |
| GarageCars | Size of garage in car capacity | Numeric |
| GarageArea | Size of garage in square feet | Numeric |
| GarageQual | Garage quality | Category - Nominal |
| GarageCond | Garage condition | Category - Nominal |
| PavedDrive | Paved driveway | Category - Nominal |
| WoodDeckSF | Wood deck area in square feet | Numeric |
| OpenPorchSF | Open porch area in square feet | Numeric |
| EnclosedPorch | Enclosed porch area in square feet | Numeric |
| 3SsnPorch | Three season porch area in square feet | Numeric |
| ScreenPorch | Screen porch area in square feet | Numeric |
| PoolArea | Pool area in square feet | Numeric |
| PoolQC | Pool quality | Category - Nominal |
| Fence | Fence quality | Category - Nominal |
| MiscFeature | Miscellaneous feature not covered in other categories | Category - Nominal |
| MiscVal | $Value of miscellaneous feature | Category - Nominal |
| MoSold | Month Sold | Category - Nominal |
| YrSold | Year Sold | Category - Nominal |
| SaleType | Type of sale | Category - Nominal |
| SaleCondition | Condition of sale | Category - Nominal |
From the original dataset, the first step of the data cleaning process was to look at the features with missing values. A function was created and a count of the missing values was generated.
## # A tibble: 19 x 2
## Feature NAs
## <chr> <int>
## 1 LotFrontage 259
## 2 Alley 1369
## 3 MasVnrType 8
## 4 MasVnrArea 8
## 5 BsmtQual 37
## 6 BsmtCond 37
## 7 BsmtExposure 38
## 8 BsmtFinType1 37
## 9 BsmtFinType2 38
## 10 Electrical 1
## 11 FireplaceQu 690
## 12 GarageType 81
## 13 GarageYrBlt 81
## 14 GarageFinish 81
## 15 GarageQual 81
## 16 GarageCond 81
## 17 PoolQC 1453
## 18 Fence 1179
## 19 MiscFeature 1406
From the description of the dataset, we know that for a number of these features, “NA” is actually a true category that we can simply fill in. The following code chunk replaces the “NA” value with a meaningful category and returns the remaining missing values from the cleaned dataset.
house.clean = replace_cats(house.orig)
From here are left with 5 features that we need to explore further. This will be outlined in the “Imputation” section.
In order to further remove missing values from the dataset, there are 5 remaining variables that need to be explored further. The MiscFeature variable is missing 96.3% of the values, and upon reviewing the description of this category we have deemed it safe to simply remove this feature.
The following table outlines the counts of each category within the Electrical variable.
table(house.clean$Electrical)
##
## FuseA FuseF FuseP Mix SBrkr
## 94 27 3 1 1334
Based on this table, we see that one of the categories, “SBrkr”, which represents the standard electrical system of a typical house, dominates the other categories. As a result, we decided to choose the dominant category to impute into the one missing value. This is done in the following code chunk.
# Impute the mode of the categorical variable
house.clean = impute_cat_mode(house.clean, "Electrical")
The following table outlines the counts of each category within the masonry veneer type variable, MasVnrTypes.
table(house.clean$MasVnrType)
##
## BrkCmn BrkFace None Stone
## 15 445 864 128
From the above table, it is not clear if there is a straightforward way to impute values into the missing values. The following code chunk completes the following steps:
house.clean = replace_cats_prop(house.clean, 'MasVnrType')
The last two variables that require imputed values are “MasVnrArea” and “LotFrontage”. This was completed using the mice package.
house.clean = impute_values(house.clean, c("MasVnrArea", 'LotFrontage'))
The following section outlines the process behind removing outliers and cleaning up unnecessary columns from the dataset.
We removed 2 data points out of the set. These 2 points are when the above ground living around is above 4000 but the sale price is less than 200000.
The following section outlines the steps taken to remove, filter and combine columns in order to create a reduced dataframe with features that are more relevant to the housing sale price.
FullBath: Full bathrooms above grade HalfBath: Half baths above grade
The Above Ground bathrooms variables were combined into one variable thru combining Full Baths and Half Baths. They were further categorized due to the variance of the amount of bathrooms, say if a house had 3+ bathrooms, they were combined with any house that had 2.5+ Above Ground bathrooms. We will show the categorization below. The Below Ground bathrooms where calculated in a very similar manner
Our BathAbove variable starts with: \[ BathAbove = FullBath + 0.5 * HalfBath \]
Our BathBelow variable starts with: \[ BathAbove = BsmtFullBath + 0.5 * BsmtHalfBath \]
house.clean %>%
mutate(colorN = ifelse((FullBath + 0.5 * HalfBath) <= 1, "1",
ifelse((FullBath + 0.5 * HalfBath) >= 2.5, '2.5+', as.character(FullBath + 0.5 * HalfBath)))) %>%
ggplot(aes(y = (SalePrice), x = factor(FullBath + 0.5 * HalfBath), fill=factor(colorN) , color = factor(colorN))) +
geom_boxplot() +
scale_fill_manual(values=c("Red", "Blue", "Blue", "Green")) +
scale_alpha_manual(values=c("1", "0", "2","2.5+")) +
geom_point(position = position_jitter(width = 0.2, height = 0.1), alpha = 0.2)
house.clean %>%
mutate( colorN = ifelse((BsmtFullBath + 0.5 * BsmtFullBath) <= 1, "1",
ifelse( (BsmtFullBath + 0.5 * BsmtFullBath) >= 2.5, "2.5+", as.character( BsmtFullBath + 0.5 * BsmtFullBath )))) %>%
ggplot(aes(y = (SalePrice), x = factor(BsmtFullBath + 0.5 * BsmtFullBath), fill=factor(colorN) , color = factor(colorN))) +
geom_boxplot() +
scale_fill_manual(values=c("Red", "Blue", "Green", "Green")) +
scale_alpha_manual(values=c("0", "1", "2","2.5+")) +
geom_point(position = position_jitter(width = 0.2, height = 0.1), alpha = 0.2)
ggplot(house.clean, aes(y = SalePrice, x = WoodDeckSF, color = WoodDeckSF)) +
geom_point() +
geom_point(position = position_jitter(width = 0.2, height = 0.1), alpha = 0.2 )
house.clean %>%
filter(WoodDeckSF > 0) %>%
ggplot(aes(y = SalePrice, x = WoodDeckSF, color = WoodDeckSF)) +
geom_point() +
geom_point(position = position_jitter(width = 0.2, height = 0.1), alpha = 0.2) +
geom_smooth(method='lm', formula= y~x)
The type of sale variable, “SaleType”, has limited variation between categories and the majority of the data falls in a couple of categories, as outlined in the following chart.
house.clean %>%
ggplot(aes(x = SaleType, y = SalePrice)) +
scale_y_continuous(labels = scales::dollar) +
geom_boxplot() +
geom_point(position = 'jitter', alpha = 0.2)
Due to the limited variation between categories, this variable was reduced into three categories: New, Deed and Other. The following chart outlines a reasonable amount of variation between these new, combined categories.
house.clean %>%
mutate(SaleType = ifelse(SaleType == 'New',
"New",
ifelse(SaleType %in% c("WD", "CWD", "VWD", "COD"),
"Deed",
"Other"))) %>%
ggplot(aes(x = SaleType, y = SalePrice)) +
scale_y_continuous(labels = scales::dollar) +
geom_boxplot() +
geom_point(position = 'jitter', alpha = 0.2)
The sale condition variable, “SaleCondition”, has limited variation between categories and the majority of the data falls in a couple of categories, as outlined in the following chart.
house.clean %>%
ggplot(aes(x = SaleCondition, y = SalePrice)) +
geom_boxplot() +
geom_point(position = 'jitter', alpha = 0.2) +
scale_y_continuous(labels = scales::dollar)
Due to the limited variation between categories, this variable was reduced into two categories: Good and Poor. The original six categories did not have a suitable amount of variation and as a result, five of these categories were combined into the “Poor” category. The following chart outlines a reasonable amount of variation between these new, combined categories.
house.clean %>%
mutate(SaleCondition = ifelse(SaleCondition %in% c("Partial"),"Good", "Poor")) %>%
ggplot(aes(x = SaleCondition, y = SalePrice)) +
geom_boxplot() +
geom_point(position = 'jitter', alpha = 0.2) +
scale_y_continuous(labels = scales::dollar)
The overall condition variable, a categorical variable with values ranging from 1 to 9, has limited variation between categories as outlined in the following chart.
house.clean %>%
ggplot(aes(y = SalePrice, x = factor(OverallCond))) +
geom_boxplot() +
geom_point(position = 'jitter', alpha = 0.2) +
scale_y_continuous(labels = scales::dollar)
this variable was reduced into two categories: Good and Bad. The reduced category, “Good”, now represents overall condition with values greater than and equal to 5 while “Bad” represents values less than 5. The following chart outlines a reasonable amount of variation between these new, combined categories.
house.clean %>%
mutate(OverallCond = ifelse(OverallCond >= 5, "Good", "Bad")) %>%
ggplot(aes(y = SalePrice, x = OverallCond)) +
geom_boxplot() +
geom_point(position = 'jitter', alpha = 0.2) +
scale_y_continuous(labels = scales::dollar)
The type of foundation variable, “Foundation”, has limited variation between categories and the majority of the data falls in a couple of categories, as outlined in the following chart.
house.clean %>%
ggplot(aes(x = Foundation, y = SalePrice)) +
stat_boxplot() +
geom_point(position = 'jitter', alpha = 0.2) +
scale_y_continuous(labels = scales::dollar, limits = c(0, 300000))
## Warning: Removed 115 rows containing non-finite values (stat_boxplot).
## Warning: Removed 115 rows containing missing values (geom_point).
Due to the limited variation between categories, this variable was reduced into two categories: “PConc” and “Other”. The “PConc” category, which represents houses with poured concrete foundations, represented the majority of the data in this dataset and its distribution was different than all other categories, hence the reduction into two categories. The following chart outlines a reasonable amount of variation between these new, combined categories.
house.clean %>%
mutate(colorN = ifelse(Foundation == "PConc", Foundation, "Other")) %>%
ggplot(aes(y = (SalePrice), x = factor(Foundation), fill=factor(colorN) , color = factor(colorN))) +
geom_boxplot() +
scale_fill_manual(values=c("Red", "Blue" )) +
scale_alpha_manual(values=c("PConc", "Other" )) +
geom_point(position = position_jitter(width = 0.2, height = 0.1), alpha = 0.2)
The general zoning classification variable, “MSZoning”, has limited variation between categories and the majority of the data falls in a couple of categories, as outlined in the following chart.
house.clean %>%
ggplot(aes(x = MSZoning, y = SalePrice)) +
geom_boxplot() +
geom_point(position = 'jitter', alpha = 0.2) +
scale_y_continuous(labels = scales::dollar, limits = c(0, 300000))
## Warning: Removed 115 rows containing non-finite values (stat_boxplot).
## Warning: Removed 115 rows containing missing values (geom_point).
Due to the limited variation between categories, this variable was reduced into two categories: “RL” and “Other”. The “RL” category, which represents houses in low density residential zones, represented the majority of the data in this dataset and its distribution was different than all other categories, hence the reduction into two categories. The following chart outlines a reasonable amount of variation between these new, combined categories.
house.clean %>%
mutate(MSZoning = ifelse(MSZoning == "RL", MSZoning, "Other")) %>%
ggplot(aes(x = MSZoning, y = SalePrice)) +
geom_boxplot() +
geom_point(position = 'jitter', alpha = 0.2) +
scale_y_continuous(labels = scales::dollar, limits = c(0, 300000))
## Warning: Removed 115 rows containing non-finite values (stat_boxplot).
## Warning: Removed 115 rows containing missing values (geom_point).
The total rooms above grade variable, “TotRmsAbvGrd”, a categorical variable with values ranging up to 14, has limited variation between categories as outlined in the following chart.
house.clean %>%
mutate(colorN = ifelse(TotRmsAbvGrd >= 7, "High", "Low")) %>%
ggplot(aes(y = (SalePrice), x = factor(TotRmsAbvGrd), fill=factor(colorN) , color = factor(colorN))) +
geom_boxplot() +
scale_fill_manual(values=c("red", "blue")) +
scale_alpha_manual(values=c("High", "Low")) +
geom_point(position = position_jitter(width = 0.2, height = 0.1), alpha = 0.2) +
xlab("Total Rooms Above Grade") +
ylab("Sale Price") +
scale_y_continuous(labels = scales::dollar) +
theme(legend.title = element_blank())
Due to the limited variation between categories, was reduced into two categories: “High” and “Low”. The reduced category, “Good”, now represents overall condition with values greater than and equal to 5 while “Bad” represents values less than 5. The following chart outlines a reasonable amount of variation between these new, combined categories.
house.clean %>%
mutate(colorN = ifelse(Fireplaces > 0, TRUE, FALSE)) %>%
ggplot(aes(y = (SalePrice), x = factor(Fireplaces), fill=factor(colorN) , color = factor(colorN))) +
geom_boxplot() +
scale_fill_manual(values=c("Red", "Blue")) +
scale_alpha_manual(values=c(FALSE, TRUE)) +
geom_point(position = position_jitter(width = 0.2, height = 0.1), alpha = 0.2) +
xlab("Total Rooms Above Grade") +
ylab("Sale Price") +
scale_y_continuous(labels = scales::dollar) +
theme(legend.title = element_blank())
Fireplaces: Number of fireplaces
house.clean %>%
mutate( colorN = ifelse(Fireplaces > 0, TRUE, FALSE)) %>%
ggplot(aes(y = (SalePrice), x = factor(Fireplaces), fill=factor(colorN) , color = factor(colorN))) +
geom_boxplot() +
scale_fill_manual(values=c("Red", "Blue")) +
scale_alpha_manual(values=c(FALSE, TRUE)) +
geom_point(position = position_jitter(width = 0.2, height = 0.1), alpha = 0.2)
The Fireplace variable was shortened simply to houses that had or did not have a fireplace. There wasn’t enough variance between houses with more than 1 fireplace, nor are their considerably more data points for houses that have more than 1 fireplace.
house.clean %>%
mutate( colorN = ifelse(GarageCars < 3, GarageCars, 3)) %>%
ggplot(aes(y = log(SalePrice), x = factor(GarageCars), fill=factor(colorN) , color = factor(colorN))) +
geom_boxplot() +
scale_fill_manual(values=c("Red", "Blue", "Green", "Yellow" )) +
scale_alpha_manual(values=c("0", "1", "2", "3")) +
geom_point(position = position_jitter(width = 0.2, height = 0.1), alpha = 0.2)
The Garage Cars variables was re-categorized into an ordinal category variable. Because there were not many 4 Garage Car homes, they are combined into one category of 3+ Garage homes.
house.clean %>%
mutate(colorN = ifelse(SaleType == 'New',"New",
ifelse(SaleType %in% c("WD", "CWD", "VWD", "COD"), "Deed", "Other"))) %>%
ggplot(aes(y = log(SalePrice),
x = reorder(factor(SaleType), log(SalePrice), FUN = median),
fill=factor(colorN) , color = factor(colorN))) +
geom_boxplot() +
scale_fill_manual(values=c("Red", "Blue", "Green" )) +
scale_alpha_manual(values=c("New", "Deed", "Other" )) +
geom_point(position = position_jitter(width = 0.2, height = 0.1), alpha = 0.2)
When considering SaleType we wanted to boxplot the values to see the variance between each category, however when considering the meaning and context of each of the categories, we grouped based on their contextual meaning of each category, ultimately re-categorizing the variables as a ordinal variable, a New Sale, Deeds sale, and other.
When considering Basement Condition, our boxplot chart explains that there is no variance relative to Sale Price, between a Basement Condition that is Good and Typical, between Fair and No Basement, and a Poor Basement Condition. As such, they are re-categorized to a Good, Average and Bad categorization.
house.clean %>%
mutate(colorN = ifelse(BsmtCond %in% c("Gd", "TA"), "Good",
ifelse(BsmtCond %in% c("Fa", NA), "Ave", "Bad"))) %>%
ggplot(aes(y = (SalePrice), x = factor(BsmtCond), fill=factor(colorN) , color = factor(colorN))) +
geom_boxplot() +
scale_fill_manual(values=c("red", "black", "blue")) +
scale_alpha_manual(values=c("Good", "Ave", "Bad")) +
geom_point(position = position_jitter(width = 0.2, height = 0.1), alpha = 0.2)
house.clean %>%
mutate(colorN = ifelse(OverallCond >= 5, "Good", "Bad")) %>%
ggplot(aes(y = (SalePrice), x = factor(OverallCond), fill=factor(colorN) , color = factor(colorN))) +
geom_boxplot() +
scale_fill_manual(values=c("Red", "Blue" )) +
scale_alpha_manual(values=c("Good", "Bad" )) +
geom_point(position = position_jitter(width = 0.2, height = 0.1), alpha = 0.2)
house.clean %>%
mutate(colorN = ifelse(Heating %in% c("GasA", "GasW"), "Gas", "NoGas")) %>%
ggplot(aes(y = (SalePrice), x = factor(Heating), fill=factor(colorN) , color = factor(colorN))) +
geom_boxplot() +
scale_fill_manual(values=c("Red", "Blue" )) +
scale_alpha_manual(values=c("Gas", "NoGas" )) +
geom_point(position = position_jitter(width = 0.2, height = 0.1), alpha = 0.2)
house.clean %>%
mutate(colorN = ifelse(BsmtFinType1 %in% c("ALQ", "BLQ", "LwQ", "Rec", "Unf"), "Ave",
ifelse(BsmtFinType1 == "GLQ", "Good", "None"))) %>%
ggplot(aes(y = (SalePrice), x = factor(BsmtFinType1), fill=factor(colorN) , color = factor(colorN))) +
geom_boxplot() +
scale_fill_manual(values=c("red", "blue", "black")) +
scale_alpha_manual(values=c("Good", "Ave", "Bad")) +
geom_point(position = position_jitter(width = 0.2, height = 0.1), alpha = 0.2)
house.clean %>%
mutate(colorN = ifelse(OverallQual >= 9, "9+",
ifelse(OverallQual <= 2, "0-",
ifelse(OverallQual > 2 & OverallQual <= 4, "3-4", OverallQual)))) %>%
ggplot(aes(y = (SalePrice), x = factor(OverallQual), fill=factor(colorN) , color = factor(colorN))) +
geom_boxplot() +
scale_fill_manual(values=c("Red", "Blue","Green", "Green", "Green", "Green", "Yellow" )) +
scale_alpha_manual(values=c("5","6","7", "8", "3-4", "9+", "0-" )) +
geom_point(position = position_jitter(width = 0.2, height = 0.1), alpha = 0.2)
#Scheffe Test
CRD.OverallQual = aov(log(SalePrice) ~ OverallQual, data = house.clean)
summary(CRD.OverallQual)
## Df Sum Sq Mean Sq F value Pr(>F)
## OverallQual 1 155.46 155.46 2931 <2e-16 ***
## Residuals 1458 77.34 0.05
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
# Fishers LSD (Least Significant Difference)
library(agricolae)
ls.OveralQual = LSD.test(CRD.OverallQual, trt='OverallQual')
ls.OveralQual
## $statistics
## MSerror Df Mean CV
## 0.05304432 1458 12.02405 1.91544
##
## $parameters
## test p.ajusted name.t ntr alpha
## Fisher-LSD none OverallQual 10 0.05
##
## $means
## log(SalePrice) std r LCL UCL Min Max
## 1 10.79880 0.3108790 2 10.47935 11.11826 10.57898 11.01863
## 2 10.82538 0.3060823 3 10.56455 11.08622 10.47195 11.00210
## 3 11.33747 0.3067162 20 11.23645 11.43850 10.54271 11.84654
## 4 11.55715 0.2802953 116 11.51520 11.59910 10.46024 12.45293
## 5 11.78066 0.2110016 397 11.75798 11.80333 10.93298 12.34126
## 6 11.96731 0.2298875 374 11.94395 11.99067 11.23849 12.53177
## 7 12.22177 0.2108982 319 12.19648 12.24707 11.32055 12.85832
## 8 12.49719 0.2313128 168 12.46234 12.53205 11.71178 13.19561
## 9 12.79327 0.2035213 43 12.72437 12.86216 12.38422 13.32393
## 10 12.92131 0.4034591 18 12.81482 13.02779 11.98293 13.53447
## Q25 Q50 Q75
## 1 10.68889 10.79880 10.90872
## 2 10.73702 11.00210 11.00210
## 3 11.23162 11.36490 11.48665
## 4 11.38509 11.58989 11.74006
## 5 11.67844 11.79810 11.89819
## 6 11.84313 11.98293 12.10625
## 7 12.09776 12.20678 12.34908
## 8 12.36545 12.50525 12.63216
## 9 12.67288 12.75130 12.87310
## 10 12.76224 12.97697 13.06656
##
## $comparison
## NULL
##
## $groups
## log(SalePrice) groups
## 10 12.92131 a
## 9 12.79327 b
## 8 12.49719 c
## 7 12.22177 d
## 6 11.96731 e
## 5 11.78066 f
## 4 11.55715 g
## 3 11.33747 h
## 2 10.82538 i
## 1 10.79880 i
##
## attr(,"class")
## [1] "group"
# Scheffes Test
df = scheffe.test(CRD.OverallQual, "OverallQual", group = TRUE, console = TRUE)$groups
##
## Study: CRD.OverallQual ~ "OverallQual"
##
## Scheffe Test for log(SalePrice)
##
## Mean Square Error : 0.05304432
##
## OverallQual, means
##
## log.SalePrice. std r Min Max
## 1 10.79880 0.3108790 2 10.57898 11.01863
## 2 10.82538 0.3060823 3 10.47195 11.00210
## 3 11.33747 0.3067162 20 10.54271 11.84654
## 4 11.55715 0.2802953 116 10.46024 12.45293
## 5 11.78066 0.2110016 397 10.93298 12.34126
## 6 11.96731 0.2298875 374 11.23849 12.53177
## 7 12.22177 0.2108982 319 11.32055 12.85832
## 8 12.49719 0.2313128 168 11.71178 13.19561
## 9 12.79327 0.2035213 43 12.38422 13.32393
## 10 12.92131 0.4034591 18 11.98293 13.53447
##
## Alpha: 0.05 ; DF Error: 1458
## Critical Value of F: 1.886289
##
## Groups according to probability of means differences and alpha level( 0.05 )
##
## Means with the same letter are not significantly different.
##
## log(SalePrice) groups
## 10 12.92131 a
## 9 12.79327 a
## 8 12.49719 b
## 7 12.22177 c
## 6 11.96731 d
## 5 11.78066 e
## 4 11.55715 f
## 3 11.33747 fg
## 2 10.82538 g
## 1 10.79880 g
| log(SalePrice) | groups |
|---|---|
| 12.85736 | a |
| 12.49719 | b |
| 12.22177 | c |
| 11.96731 | d |
| 11.78066 | e |
| 11.52484 | f |
| 10.81475 | g |
There are three different measures that all relate to the age of a home: YrSold, YearBuilt, andYearRemodAdd. We wanted to combine these into a single Age variable. It should be noted that if a house did not have a renovation then YearRemodAdd is equal to YearBuilt.
Our initial estimate for this value was: \[ Age = Y_{sold} - \frac{Y_{built} + Y_{renov}}{2} \]
Which, can be rewritten as: \[ Age = Y_{sold} - w \cdot Y_{built} + (1-w) \cdot Y_{renov} \]
We then wanted to optimize the value of \(w\) so we built a small linear regression model and ran a grid search of \(w\) values using 10-Fold Cross Validation. After a number of iterations we found an optimized value of \(w=0.58\) which minimized the RMSE of the 10-Fold Cross Validation.
results = data.frame() # create an empty dataframe to store the result of the grid search
age_seq = seq(0.57,0.59,by=0.001) # final sequence of values to search through
old_age = 30
for(w in age_seq) {
# create a temporary data frame with the new formula for age
temp.data = house.clean %>% mutate(Age = YrSold - (w * YearBuilt + (1-w) * YearRemodAdd),
Old = Age >= old_age)
# set up the cross validation
set.seed(123)
train.control = trainControl(method = "cv", number = 10)
# Train the model
model = train(log(SalePrice) ~ GrLivArea + Age:Old, data = temp.data,
method = "lm",
trControl = train.control)
# save the results
results = rbind(results, cbind(w=w, model$results))
}
# display the results of the grid search
results
## w intercept RMSE Rsquared MAE RMSESD RsquaredSD
## 1 0.570 TRUE 0.2115273 0.7255814 0.1529785 0.02305752 0.05631114
## 2 0.571 TRUE 0.2115279 0.7255836 0.1529698 0.02306409 0.05632792
## 3 0.572 TRUE 0.2115270 0.7255957 0.1529396 0.02307069 0.05635035
## 4 0.573 TRUE 0.2115278 0.7255972 0.1529313 0.02307722 0.05636720
## 5 0.574 TRUE 0.2115288 0.7255982 0.1529232 0.02308373 0.05638409
## 6 0.575 TRUE 0.2115496 0.7255575 0.1529215 0.02310546 0.05640636
## 7 0.576 TRUE 0.2115509 0.7255576 0.1529140 0.02311191 0.05642315
## 8 0.577 TRUE 0.2115524 0.7255572 0.1529070 0.02311834 0.05643997
## 9 0.578 TRUE 0.2115541 0.7255565 0.1529003 0.02312476 0.05645682
## 10 0.579 TRUE 0.2115559 0.7255553 0.1528935 0.02313115 0.05647369
## 11 0.580 TRUE 0.2115732 0.7255053 0.1529106 0.02311000 0.05646077
## 12 0.581 TRUE 0.2115753 0.7255035 0.1529038 0.02311643 0.05647781
## 13 0.582 TRUE 0.2115798 0.7254900 0.1529158 0.02312298 0.05650494
## 14 0.583 TRUE 0.2115821 0.7254874 0.1529093 0.02312935 0.05652196
## 15 0.584 TRUE 0.2115671 0.7255435 0.1528804 0.02314697 0.05660164
## 16 0.585 TRUE 0.2115696 0.7255403 0.1528747 0.02315332 0.05661875
## 17 0.586 TRUE 0.2115724 0.7255367 0.1528693 0.02315964 0.05663589
## 18 0.587 TRUE 0.2116236 0.7253551 0.1529105 0.02310208 0.05665449
## 19 0.588 TRUE 0.2116265 0.7253508 0.1529045 0.02310846 0.05667181
## 20 0.589 TRUE 0.2116296 0.7253462 0.1528985 0.02311483 0.05668915
## 21 0.590 TRUE 0.2116329 0.7253412 0.1528925 0.02312117 0.05670652
## MAESD
## 1 0.008881293
## 2 0.008883075
## 3 0.008865666
## 4 0.008867364
## 5 0.008868974
## 6 0.008900377
## 7 0.008902413
## 8 0.008904215
## 9 0.008905922
## 10 0.008907649
## 11 0.008873899
## 12 0.008875582
## 13 0.008860559
## 14 0.008862365
## 15 0.008849782
## 16 0.008852370
## 17 0.008855259
## 18 0.008826186
## 19 0.008828753
## 20 0.008831339
## 21 0.008833944
# and now we can find the optimal weight for YearBuilt
results %>% filter(RMSE == min(RMSE))
## w intercept RMSE Rsquared MAE RMSESD RsquaredSD
## 1 0.572 TRUE 0.211527 0.7255957 0.1529396 0.02307069 0.05635035
## MAESD
## 1 0.008865666
w_age = 0.579
new_age = 5
old_age = 30
ggplot(house.clean, aes(x=(YrSold - (w_age * YearBuilt + (1-w_age) * YearRemodAdd)),
y=SalePrice), color=SalePrice) +
geom_point() +
geom_smooth(formula= y~x) +
scale_y_continuous(labels = scales::dollar) +
xlab("Age")
## `geom_smooth()` using method = 'gam' and formula 'y ~ s(x, bs = "cs")'
YrSold - (w_age * YearBuilt + (1-w_age) * YearRemodAdd)
There were a number of variables that were connected and in an effort to reduce similar variables in our feature engineering process, we created the following formula to convert YearBuilt, YearRemodAdd, W_age. Combining these variables returned the linear regression as seen above.
We know from everyday life that the neighborhood that you live in is related to the price of your house, but we suspected many neighborhoods to share similar characteristics. We analyzed this in some depth as follows.
ggplot(house.clean, aes(y = log(SalePrice), x = factor(Neighborhood), color = factor(Neighborhood))) +
geom_boxplot() +
geom_point(position = position_jitter(width = 0.2, height = 0.1), alpha = 0.2) +
theme(axis.text.x = element_text(angle = 90, hjust = 1)) +
xlab("Neighbourhood") +
ylab("Sale Price (log)")
ggplot(house.clean, aes(y = log(SalePrice), x = reorder(factor(Neighborhood), log(SalePrice), FUN = median), color = factor(Neighborhood))) +
geom_boxplot() +
geom_point(position = position_jitter(width = 0.2, height = 0.1), alpha = 0.2) +
theme(axis.text.x = element_text(angle = 90, hjust = 1), legend.position = "None") +
xlab("Neighbourhood") +
ylab("Sale Price (log)")
| log(SalePrice) | Scheffe | TeamHacksaw | |
|---|---|---|---|
| NoRidge | 12.68 | a | a |
| NridgHt | 12.62 | a | a |
| StoneBr | 12.59 | ab | a |
| Timber | 12.36 | abc | b |
| Veenker | 12.34 | abcd | b |
| Somerst | 12.30 | bcd | b |
| ClearCr | 12.24 | bcd | c |
| Crawfor | 12.21 | bcd | c |
| Blmngtn | 12.17 | bcde | d |
| CollgCr | 12.16 | cde | d |
| Gilbert | 12.16 | cde | d |
| NWAmes | 12.13 | cde | d |
| SawyerW | 12.09 | cde | e |
| Mitchel | 11.93 | def | f |
| NAmes | 11.87 | defg | f |
| NPkVill | 11.87 | defgh | f |
| SWISU | 11.84 | defgh | f |
| Blueste | 11.83 | defgh | f |
| Sawyer | 11.81 | efgh | f |
| Edwards | 11.71 | fgh | g |
| OldTown | 11.70 | fgh | g |
| BrkSide | 11.68 | fgh | g |
| BrDale | 11.55 | fgh | h |
| MeadowV | 11.47 | gh | h |
| IDOTRR | 11.45 | h | h |
house.clean %>%
group_by(Alley) %>%
ggplot(aes(x = Alley, y = SalePrice)) +
geom_boxplot() +
geom_point(position = 'jitter', alpha = 0.2) +
scale_y_continuous(labels = scales::dollar)
When we reviewed the Alley, the chart shows that the variable doesn’t contribute to the selling price as the majority of the data falls in the “None” category and the distribution of SalePrice is not significantly different between categories.
house.clean %>%
group_by(BldgType) %>%
ggplot(aes(x = BldgType, y = SalePrice)) +
geom_boxplot() +
geom_point(position = 'jitter', alpha = 0.2) +
scale_y_continuous(labels = scales::dollar)
When we reviewed the BldgType, the chart shows that the variable doesn’t contribute to the selling price as the majority of the data falls in the “1Fam” category and the distribution of SalePrice is not significantly different between categories.
house.clean %>%
group_by(BsmtExposure) %>%
ggplot(aes(x = BsmtExposure, y = SalePrice)) +
geom_boxplot() +
geom_point(position = 'jitter', alpha = 0.2) +
scale_y_continuous(labels = scales::dollar)
When we reviewed the BsmtExposure, the chart shows that the variable doesn’t contribute to the selling price as the majority of the data falls in the “No” category and the distribution of SalePrice is not significantly different between categories.
house.clean %>%
group_by(HeatingQC) %>%
ggplot(aes(x = HeatingQC, y = SalePrice)) +
geom_boxplot() +
geom_point(position = 'jitter', alpha = 0.2) +
scale_y_continuous(labels = scales::dollar)
When we reviewed the HeatingQC, the chart shows that the variable doesn’t contribute to the selling price as the distribution of SalePrice is not significantly different between categories.
house.clean %>%
group_by(LotShape) %>%
ggplot(aes(x = LotShape, y = SalePrice)) +
geom_boxplot() +
geom_point(position = 'jitter', alpha = 0.2) +
scale_y_continuous(labels = scales::dollar)
When we reviewed the LotShape, the chart shows that the variable doesn’t contribute to the selling price as the distribution of SalePrice is not significantly different between categories.
house.clean %>%
filter(PoolArea > 0) %>%
ggplot(aes(x = PoolArea, y = SalePrice)) +
geom_point() +
geom_smooth(method = 'lm') +
scale_y_continuous(labels = scales::dollar)
When we reviewed the Pool Area, the chart shows that the variable doesn’t contribute to the selling price as there are very few houses sold with a pool in the dataset. Additionally, there is no relationship between Pool Area and SalePrice.
house.clean %>%
group_by(Fence) %>%
ggplot(aes(x = Fence, y = SalePrice)) +
geom_boxplot() +
geom_point(position = 'jitter', alpha = 0.2) +
scale_y_continuous(labels = scales::dollar)
When we reviewed the BsmtExposure, the chart shows that the variable doesn’t contribute to the selling price as the majority of the data falls in the “None” category and the distribution of SalePrice is not significantly different between categories.
house.clean %>%
filter(MiscVal > 0, MiscVal < 2500) %>%
ggplot(aes(x = MiscVal, y = SalePrice)) +
geom_point() +
scale_y_continuous(labels = scales::dollar) +
geom_smooth(method = 'lm')
When investigating the MiscVal variable, there is no distinct correlation between sale price and the misc values people assign the property.
house.clean %>%
group_by(Electrical) %>%
ggplot(aes(x = Electrical, y = SalePrice)) +
geom_boxplot() +
geom_point(position = 'jitter', alpha = 0.2) +
scale_y_continuous(labels = scales::dollar)
When we reviewed the Electrical variable, the chart shows that the variable doesn’t contribute to the selling price as the majority of the data falls in the “SBrkr” category and the distribution of SalePrice is not significantly different between categories.
house.clean %>%
mutate(Screen = ScreenPorch > 0) %>%
ggplot(aes(x = Screen, y = SalePrice)) +
geom_boxplot() +
geom_point(position = 'jitter', alpha = 0.2) +
scale_y_continuous(labels = scales::dollar)
When we reviewed the ScreenPorch variable, the chart shows that the variable doesn’t contribute to the selling price as the majority of the data falls in the “FALSE” category and the distribution of SalePrice is not significantly different between categories.
house.clean %>%
ggplot(aes(x = as.factor(BedroomAbvGr), y = SalePrice)) +
geom_boxplot() +
geom_point(position = 'jitter', alpha = 0.2) +
scale_y_continuous(labels = scales::dollar)
When we reviewed the BedroomAbvGr variable, the chart shows that the variable doesn’t contribute to the selling price as the majority of the data falls in the “3” category and the distribution of SalePrice is not significantly different between categories.
house.clean %>%
ggplot(aes(x = reorder(HouseStyle, SalePrice), y = SalePrice)) +
geom_boxplot() +
geom_point(position = 'jitter', alpha = 0.2) +
scale_y_continuous(labels = scales::dollar)
When we reviewed the HouseStyle variable, the chart shows that the variable doesn’t contribute to the selling price as the majority of the data falls in the “1Story” and “2Story” categories and the distribution of SalePrice is not significantly different between categories.
The variables of BsmtFinSF1, BsmtFinSF2 and BsmtUnfSF were removed because they are redundant. The information is also a part of the BasementFin which is the square footage of the basement.
The variables of BsmtFullBath and BsmtHalfBath were merged together because they are redundant.
The variable of Exterior2nd was removed because this is redundant. The information is also a part of the Exterior.
These variables were removed because they were redundant. All of these variables are merged into 1 variable.
Condition1: Proximity to various conditions Artery Adjacent to arterial street Feedr Adjacent to feeder street
Norm Normal
RRNn Within 200’ of North-South Railroad RRAn Adjacent to North-South Railroad PosN Near positive off-site feature–park, greenbelt, etc. PosA Adjacent to postive off-site feature RRNe Within 200’ of East-West Railroad RRAe Adjacent to East-West Railroad
#Condition1
ggplot(house.clean, aes( x = factor(Condition1))) +
geom_histogram(stat="count")
p2 = ggplot(house.feature, aes(x=factor(Functional))) +
geom_histogram(stat="count")
p3 = ggplot(house.feature, aes(x=factor(BathBelow))) +
geom_histogram(stat="count")
p4 = ggplot(house.feature, aes(x=factor(BasementCond))) +
geom_histogram(stat="count")
p5 = ggplot(house.feature, aes(x=factor(GasFurnace))) +
geom_histogram(stat="count")
p6 = ggplot(house.feature, aes(x=factor(ExteriorCovering))) +
geom_histogram(stat="count")
p7 = ggplot(house.clean, aes(x=factor(Condition1))) +
geom_histogram(stat="count")
p8 = ggplot(house.clean, aes(x=factor(Condition2))) +
geom_histogram(stat="count")
p9 = ggplot(house.clean, aes(x=factor(KitchenAbvGr))) +
geom_histogram(stat="count")
p10 = ggplot(house.clean, aes(x=factor(LandContour))) +
geom_histogram(stat="count")
p11 = ggplot(house.clean, aes(x=factor(LandSlope))) +
geom_histogram(stat="count")
p12 = ggplot(house.clean, aes(x=factor(LotConfig))) +
geom_histogram(stat="count")
p13 = ggplot(house.clean, aes(x=factor(PoolQC))) +
geom_histogram(stat="count")
p14 = ggplot(house.clean, aes(x=factor(RoofMatl))) +
geom_histogram(stat="count")
p15 = ggplot(house.clean, aes(x=factor(RoofStyle))) +
geom_histogram(stat="count")
p16 = ggplot(house.clean, aes(x=factor(Street))) +
geom_histogram(stat="count")
p17 = ggplot(house.clean, aes(x=factor(Utilities))) +
geom_histogram(stat="count")
plot_grid(p2, p3, p4, p5, p6, p7, p8, ncol=2)
plot_grid(p9, p10, p11, p12,p13,p14, p15, p16, p17, ncol=2)
These histograms show there isn’t as wide enough spread in the distribution so we decided to remove the following variables:
When considering considering the Condition1 variable, there was not enough variance to consider as part of our model.
Since our data is naturally stratified, and stratified sampling is always a more precise estimation than simple random sampling, we are using stratified sampling.
To further justify our use of stratified sampling we have created an ANOVA table which shows that the variation between our strata is greater than the variation within our strata.
summary(aov(log(house.feature$SalePrice) ~ house.feature$Neighborhood))
## Df Sum Sq Mean Sq F value Pr(>F)
## house.feature$Neighborhood 5 131.6 26.32 377.7 <2e-16 ***
## Residuals 1452 101.2 0.07
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Cook, D. (2011, June). Ames, Iowa: Alternative to Boston Housing Data as an End of Semester Regression Project[Online]. Available at: https://www.kaggle.com/c/house-prices-advanced-regression-techniques/data (Retrieved January 20, 2020)